Bladder cancer is one of the most common disorders worldwide, with more than 573,000 new cases in 2020 and transitional cell carcinoma (TCC) being the most common primary neoplasm of the urinary bladder. Moreover, it is a disease that is generally diagnosed early, when the cancer is highly treatable. However, even early-stage bladder cancers can return after successful treatments.
In this work, we have based on the scientific paper published by Guo G. et al. (Guo et al. 2013), where the researchers have performed a genomic analysis of TCC by both whole-genome and whole-exome sequencing in order to not only confirm existing mutations that are documented in the literature but also to identify additional genes and pathways that play a key role in this disease. For this, we have performed both a DNA-Seq and an RNA-Seq analysis to try to replicate some of the results obtained in the paper.
FOTO
Regardless of the distinction made in this project between DNA and RNA samples, the first step consisted of obtaining the FastQCs and MultiQCs for both cases. The FastQC file is used to represent the read information of the sequencing process. Each read is composed of 4 lines: name, sequence and sequence quality. It should be noted that in this case we worked with paired-end samples, so each of the samples had to result in two FastQC files. Taking into account that we started with 28 DNA samples, 28 RNA samples and that for each sample 2 FastQC files were generated, as a result we obtained a large number of files that would have been difficult to analyze individually. Therefore, in order to simplify the analysis, the next step was to merge all the files (separated by RNA and DNA) to create a single file for DNA and another one for RNA using the multiqc function.
In order to obtain the FastQCs, a job array was created to speed up the process and, specifically, the code followed for each of the samples was as follows:
#!/bin/bash
#SBATCH --job-name=Kallisto
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1
#SBATCH --mem=3G
#SBATCH --time=00:20:00
##SBATCH --mail-type=END
##SBATCH --mail-user=a905383@alumni.unav.es
#SBATCH -o /scratch/a905383/HPC_A/logs_Kallisto/kallisto%a.log
#SBATCH -o /scratch/a905383/HPC_A/fasta_files/logs/logs%a.log
export PATH="/scratch/a905383/FastQC:$PATH"
module load Java
file=($(ls /scratch/arubio/RNAseq/*.fastq.gz))
filename=${file[$SLURM_ARRAY_TASK_ID]}
echo $filename
fastqc $filename -o /scratch/arubio/RNAseq/FastqcReports
Once all the FastQCs were obtained, the efficiency of the process was analysed after having used the indicated resources and the results are shown in Figure 1. In particular, it can be seen that both the efficiency of the memory and the CPU was very low, indicating that the resources used were excessive. In other words, resources were wasted. Therefore, it can be concluded that the results would have been better if the amount of memory had been reduced and if fewer CPUs had been used.
Figure 1: Efficiency of the process to obtain FastQCs.
The next step was to generate a MultiQC for the DNA samples and another one for the RNA samples. To do this, the multiqc command was used and the corresponding files were generated directly from the command line.
Then, after having carried out the first step for both types of samples, the analysis was separated in two, i.e. on one hand, an analysis was carried out with the DNA samples and, secondly, a study was carried out with the RNA samples.
Figure 2: Flow diagram of DNA samples processing.
In order to perform the alignment, the first step was to download the FASTA file of the reference genome, i.e. the file in which the genomic information is described. Specifically, it was obtained from the Ensemble website (cite) and for this task no Atlas work was performed, since the FASTA was downloaded directly using the command line.
wget ftp://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Once the FASTA file was obtained, the samples were aligned using the BWA men algorithm based on the Burrows-Wheeler transform. The code described below was used:
#SBATCH --job-name=bwa%a
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=10
#SBATCH --nodes=1
#SBATCH --mem=64G
#SBATCH --time=04:00:00
##SBATCH --mail-type=END
##SBATCH --mail-user=a905383@alumni.unav.es
#SBATCH -o /scratch/a905383/HPC_A/logs_BWA/bwa%a.log
source /scratch/arubio/bin/moduleloadNGS
file=($(cd /scratch/arubio/DNAseq/ && basename --suffix=_1.fastq.gz -- *_1.fastq.gz))
filename=${file[$SLURM_ARRAY_TASK_ID]}
SAMPLES_DIR=/scratch/arubio/DNAseq
echo $SAMPLES_DIR"/Auxfiles/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
nice bwa mem -M -t 10 -R "@RG\tID:ind1\tSM:ind1\tLB:ind1\tPU:ind1\tPL:Illumina" $SAMPLES_DIR"/Auxfiles/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz" $SAMPLES_DIR"/"$filename"_1.fastq.gz" $SAMPLES_DIR"/"$filename"_2.fastq.gz" samblaster -M | samtools fixmate - - | samtools sort -O bam -o $SAMPLES_DIR"/resultsBWA/"$filename"_aligned.bam"
It should be noted that as we worked with samples sequenced in paired-end mode, it was necessary to use the two FastQC files generated for each of the samples in the first section. As a result, a .bam file was obtained for each sample, i.e. a binary file was obtained containing the information from the FastQC file and the information of the mapping against the genome (chromosome, mapping position, CIGAR…). Once the .bam files were generated, the efficiency of the process was analysed in order to study whether the resources used had been correct and the results are shown in Figure 3:
Figure 3: Efficiency of the process to obtain .bam files.
After generating the .bam files we maked the index for all the .BAM genomic samples, which were stored in the resultsBWA directory. First, the ATLAS resources used were 1GB of memory and we set a 1-hour job. This failed, as the index needed more memory to be built. For this reason, the memory was increased to 100GB randomly. In the end, the job was completed but after analyzing the memory efficiency of the resources used (1 hour and 100GB) only 4.33GB were actually used. Therefore the Memory efficiency was only 4.33%. On the other hand, the CPU Efficiency was 99.06% which is really good and the CPU utilized was 52 minutes and 49 seconds, which is a value within our set period. In the future, to perform another BWA index it would be recommended to only use around 5GB of memory and again 1 hour of CPU time if the index is going to be built with the same number of samples, which is approx 28 samples.
#!/bin/bash
#SBATCH --job-name=index_bai
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --mem=100G
#SBATCH --time=1:00:00
cd /scratch/arubio/bin
./ijob
cd /scratch/arubio/bin
. moduleloadNGS
cd /scratch/arubio/DNASeq/resultsBWA
for SAMPLE in /scratch/arubio/DNASeq/resultsBWA/*bam;
do
samtools index $SAMPLE
done
After obtaining the first .bam files with their corresponding indexes, the duplicates were eliminated, i.e. each of the aligned files was analysed and the reads that were the same were eliminated. In this way, if in one of the samples there was a mutation that was repeated multiple times, it is checked if this mutation is relevant or if the fact that it is repeated multiple times is due to an error in the amplification.
To carry out this step, a job array was generated again in order to speed up the process and the .sbs file generated to run the job array was as follows:
#!/bin/bash
#SBATCH --job-name=Duplicates
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=4
#SBATCH -N 1 # On one node
#SBATCH --mem=16G
#SBATCH --time=0-23:0:0
##SBATCH --mail-type=END
##SBATCH --mail-user=a905162@alumni.unav.es
#SBATCH -o /scratch/a905162/ngs/logs/duplicates%a.log
source /scratch/arubio/bin/moduleloadNGS
FILES=($(cd /scratch/arubio/DNAseq/resultsBWA/ && ls *.bam))
filename=${FILES[$SLURM_ARRAY_TASK_ID]}
java -Djava.io.tmpdir=`pwd`/tmp -jar $PICARD MarkDuplicates -I /scratch/arubio/DNAseq/resultsBWA/$filename -O /scratch/arubio/DNAseq/resultsBWA/noduplicantesbam/$filename -M /scratch/arubio/DNAseq/resultsBWA/noduplicantesbam/$(basename $filename .bam)_metrics.txt
When executing the code, we initially tried to remove duplicates from a single sample to check the resources needed to run the job array. It should be noted that when removing the duplicates there were a lot of problems with Java because it was mentioned that there was no space left, but when we checked that there was enough space to perform this step, finally a temporary folder was created that allowed us to run the code and obtain the new .bam file having removed the duplicates.
After obtaining the result of a single sample, the efficiency of the process was analysed and it was seen that 11 GB had been used and that the efficiency of the CPU had been of 11.77%. Therefore, when sending the job-array, the file was modified to the characteristics described in the code, i.e. 4 CPUs and a memory of 16 GB.
Once the job-array was executed, the efficiency was checked and, taking into account that the resources had been modified, it was theoretically thought that the results would be good. However, analysing Figure 3, it can be seen that both CPU and memory efficiency are very low, which means that resources were wasted.
Figure 4: Efficiency of the process to obtain .bam files without duplicates.
Once the duplicates were removed, the next step was to generate the corresponding vcf’s, i.e. to generate the file that stores the gene sequence variations and their information, in other words, the genetic variants with respect to the reference genome. The following code was used for this purpose:
#!/bin/bash
#SBATCH --job-name=vcf
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=8
#SBATCH -N 1 # On one node
#SBATCH --mem=8G
#SBATCH --time=0-23:0:0
##SBATCH --mail-type=END
##SBATCH --mail-user=a905162@alumni.unav.es
#SBATCH -o /scratch/a905162/ngs/logs/vcf%a.log
source /scratch/arubio/bin/moduleloadNGS
FILES=($(cd /scratch/arubio/DNAseq/resultsBWA/noduplicantesbam/ && ls *.bam))
filename=${FILES[$SLURM_ARRAY_TASK_ID]}
gatk HaplotypeCaller --native-pair-hmm-threads 8 -I /scratch/arubio/DNAseq/resultsBWA/noduplicantesbam/$filename -R /scratch/arubio/DNAseq/Auxfiles/dummy.fa -O /scratch/arubio/DNAseq/vcf/$filename.vcf
In the same way as before, a job-array was programmed, but before executing it, the vcf of a single sample was generated to check that the code worked and that the resources used were adequate. The result was a CPU efficiency of 67.96% and 3.73% in terms of memory. Therefore, analysing the results, it was decided to modify the script to reduce as much as possible the waste of resources and the resources described in the code were established, i.e. 8 CPUs and 8 GB of memory.
After many hours of job-array execution, the results were found to be correct and the efficiency of the process was studied in Figure 4. In the same way as before, high efficiency percentages were expected, however, in the case of memory, this was not the case, as it can be seen that the result was 22.13%. In any case, the CPU efficiency was 86%, which indicates that selecting 8 cores for each of the tasks was correct.
Figure 5: Efficiency of the process to generate vcfs.
After having calculated the vcf’s, the next step consisted of obtaining the reports of each of the generated vcf’s. In other words, a table was created for each of the samples where a summary of what was analysed in the vcf was represented. For this purpose, the following code has been used:
#!/bin/bash
#SBATCH --job-name=vcf1_report
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=2
#SBATCH -N 1 # On one node
#SBATCH --mem=1G
#SBATCH --time=0-23:0:0
##SBATCH --mail-type=END
##SBATCH --mail-user=a905162@alumni.unav.es
#SBATCH -o /scratch/a905162/ngs/logs/report1_%a.log
source /scratch/arubio/bin/moduleloadNGS
FILES=($(cd /scratch/arubio/DNAseq/resultsBWA/noduplicantesbam/ && ls *.bam))
filename=${FILES[$SLURM_ARRAY_TASK_ID]}
gatk BaseRecalibrator -R /scratch/arubio/DNAseq/Auxfiles/dummy.fa -I /scratch/arubio/DNAseq/resultsBWA/noduplicantesbam/$filename --known-sites /scratch/arubio/DNAseq/vcf/$filename.vcf -O /scratch/arubio/DNAseq/vcf/reports1/$(basename $filename )_report_1.table
Following the same procedure as before, the code was initially tested with a single sample in order to check that the result was correct and that the resources were consistent. However, when testing the CPU and memory efficiencies were of 12.44% and 1.78% respectively, so the resources had to be modified and it was decided that 2 CPUs per task and 1 GB of memory were sufficient.
After having checked that the code worked and that the output was correct, the job-array was run to obtain the reports of all the samples and finally the efficiency of the process was checked and this is shown in Figure 5. In particular, it was observed that the efficiencies were lower than expected, so it would have been more optimal to reduce the resources.
Figure 6: Efficiency of the process to obtain the reports of the vcfs.
Once the first vcf’s and their reports were obtained, the .bam files were recalibrated, i.e. the quality of the files was modified. The reason for having to perform this step was because the assigned qualities are per-base error estimates issued by the sequencing machines. Therefore, this means that the quality scores of the bases may be over- or underestimated. For this reason base quality recalibration (BQSR) is a process that models the empirical errors and adjusts the quality scores thus making the qualities more accurate.
For this purpose, the following code has been generated:
#!/bin/bash
#SBATCH --job-name=recalibration
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=2
#SBATCH -N 1
#SBATCH --mem=1G
#SBATCH --time=0-23:0:0
#SBATCH --mail-type=END
#SBATCH --mail-user=a905162@alumni.unav.es
#SBATCH -o /scratch/a905162/ngs/logs/rec_%a.log
source /scratch/arubio/bin/moduleloadNGS
FILES=($(cd /scratch/arubio/DNAseq/resultsBWA/noduplicantesbam/ && ls *_aligned.bam))
filename=${FILES[$SLURM_ARRAY_TASK_ID]}
gatk ApplyBQSR -I /scratch/arubio/DNAseq/resultsBWA/noduplicantesbam/$filename --bqsr /scratch/arubio/DNAseq/vcf/reports1/$filename"_report_1.table" -O /scratch/a905162/ngs/recalibration2/$filename"_recalibrated.bam"
Initially, we started by recalibrating a single sample, but multiple errors occurred, which prevented us from reaching the final result. However, analysing the errors, it was observed that the problem was the lack of space in the scratch, so it was thought that by moving the files to our personal scratch the problem would be solved. However, when this was done, it was found that the space was common to the whole group, so moving the files to our personal folders didn’t do anything. With this in mind, the next step was to move already used files to the personal dipc folder to free up space in the scratch in order to be able to generate the recalibrated files. As a result of this action, a single sample was run and in Figure 6 it can be seen that a result was obtained although the efficiencies were not as desired.
Figure 7: Efficiency of the process to obtain the recalibrated .bam files.
Due to the fact that it was possible to recalibrate a single sample, it was thought that a job-array could be sent with the rest of the samples, but the lack of space prevented a final result in all the attempts made. Therefore, we started to obtain the files manually and one by one, but taking into account that this process was not efficient and that we reached a situation in which there was no space left on the disk, we decided to stop the process here and went directly to the next section.
After deciding not to continue with the recalibrations and therefore not obtaining the final vcf, we proceeded to continue with the study using the vcf generated in section 2.3. Specifically, the next step consisted of using the Ensembl Variant Effect Predictor (VEP) tool to determine the effect of the variants obtained in the genes, transcripts and protein sequence, as well as in the regulatory regions https://www.ensembl.org/info/docs/tools/vep/index.html. Specifically, the web interface that allows access to the main functions of the VEP without the need to use the command line was used. https://pubmed.ncbi.nlm.nih.gov/20562413/
It should be noted that when submitting the samples it was necessary to analyse the size of each of the files because Ensemble only accepts files smaller than 50 MB. Therefore, a filtering of the samples that exceeded this size was made and in addition, as this process was not possible to be parallelize, each student uploaded two samples of normal .vcf files in order to be annotated. As this website is free of charge, the speed was not outstanding and took around five hours of runtime for each sample to be annotated. The distribution of samples can be seen on the table beneath.
Figure 8: Samples uploaded to Ensemble.
After the annotation process, the generated files were included in Atlas and at this point it was analysed that the files needed to be modified, as the control characters for the line endings were not adequate. That is, there are mainly two types of control characters for line endings and depending on the operating system these differ. Taking into account that the generated files contained both CR (carriage return) and LR (line feed) and that Linux only allows LF, the files had to be modified in order to use them in the Linux operating system.
Therefore, the files were modified and the annotated files were converted into .maf files. To do this, perl was used and in particular, the vcf2maf function was used. It should be noted that when creating a .maf file, only files that correspond to the tumour samples are used, as the aim is to analyse the mutations in the genes. Therefore, the first step was to study the type of sample we were working with in order to distinguish between tumour and healthy samples. With this information and taking into account that the vcf2maf function requires both the tumour and healthy sample IDs, the following table shows the result of this analysis where each of the tumour sample has been represented with its corresponding healthy sample.
Figure 9: Tumor sample ID with it’s corresponding healthy sample ID.
With this information, the function vcf2maf was executed and as a result a .maf file was obtained for each of the tumour sample represented in the table of Figure 9. In other words, the file that stores the somatic variants detected was obtained with the aim of finally being able to study the mutations observed in the tumour samples.
#!/bin/bash
#SBATCH --job-name=vcf2maf
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=1
#SBATCH -N 1
#SBATCH --mem=2G
#SBATCH --time=0-23:0:0
#SBATCH --mail-type=END
#SBATCH --mail-user=a905162@alumni.unav.es
#SBATCH -o /scratch/a905162/ngs/logs/vcf2maf_3_%a.log
source /scratch/arubio/bin/moduleloadNGS
files=($(cd /scratch/arubio/DNAseq/vcf/ && ls *_sinCR.vcf))
filename=${files[3]}
perl /scratch/arubio/bin/mskcc-vcf2maf-754d68a/vcf2maf.pl --input-vcf /scratch/arubio/DNAseq/vcf/$filename --output-maf /scratch/arubio/DNAseq/vcf/$(basename $filename _anotated_sinCR.vcf).maf --inhibit-vep --ref-fasta //scratch/arubio/.vep/homo_sapiens/102_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --ncbi-build GRCh38 --tumor-id SRR645235 --normal-id SRR645236
It should be noted that unlike other sections, in this case a job-array was not performed because, as can be seen at the end of the code, it is necessary to indicate both the names of the tumour sample and the normal sample, which would have been complicated. Therefore, to facilitate the process and ensure that it was done correctly, it was decided to send individual jobs modifying by hand the name of the samples for each of the analyses. However, regardless of whether the process followed was different, once the corresponding files were obtained, the efficiency was studied and it has been shown in Figure 10.
Figure 10. Efficiency of the process to obtain .maf files.
Analysing the results, it can be seen that the efficiencies are significant, so the resources used were adequate.
Once the .maf files had been obtained, they were compressed using the gzip command and then each of the files was downloaded for further analysis using R.
With the 8 .maf files downloaded, they were merged into a single .maf file to study the mutations together.
setwd("/Users/laura/Desktop/TECNUN/Máster/2º Semestre/Bioinformatics and Next Generation Sequencing/Final work")
samples <- dir(pattern = "^SR")
# Name of the files in samples
my_sample <- samples[1]
cat("Processing sample:", my_sample,"\n")
maf <- data.table::fread(file = my_sample, sep = "\t", stringsAsFactors = FALSE,
verbose = FALSE, data.table = TRUE, showProgress = TRUE,
header = TRUE, fill = TRUE, skip = "Hugo_Symbol",
quote = "")
maffinal <- maf[,c(-46,-57,-14,-3, -4, -8, -15, -20, -21,
-22, -23, -24, -25, - 26, -27 , - 28, -29 , - 30,
-31 , -32, -33 , -34 , -35 , -36 , -37 , -40 , -41,
-42, -43, -44, -45, -58, -65, - 66, -67 , - 68,
-69 , - 70, -71 , -76 , -78 , -79 , -80 , -81 , -82 ,
-83, -84, -85, -94, -95, -97, -99, -100, -101,
-103, -105, -106, -107,-108,-109, -110, -111, -112, -113 )]
for (my_sample in samples[-1]) {
cat("Processing sample:", my_sample,"\n")
maf <- data.table::fread(file = my_sample, sep = "\t", stringsAsFactors = FALSE,
verbose = FALSE, data.table = TRUE, showProgress = TRUE,
header = TRUE, fill = TRUE, skip = "Hugo_Symbol",
quote = "")
maf <- maf[,c(-46,-57,-14,-3, -4, -8, -15, -20, -21,
-22, -23, -24, -25, - 26, -27 , - 28, -29 , - 30,
-31 , -32, -33 , -34 , -35 , -36 , -37 , -40 , -41,
-42, -43, -44, -45, -58, -65, - 66, -67 , - 68,
-69 , - 70, -71 , -76 , -78 , -79 , -80 , -81 , -82 ,
-83, -84, -85, -94, -95, -97, -99, -100, -101,
-103, -105, -106, -107,-108,-109, -110, -111, -112, -113 )]
# Fields not needed removed. Still many more can be removed
maffinal <- rbind(maffinal,maf)
}
# Convert into a maf object
rm(maf); gc() # Remove no longer used objects and garbage collection
NGS <- read.maf(maffinal)
rm(maffinal); gc() # Remove no longer used objects and garbage collection
Once the MAF file had been obtained, it was decided to generate a series of graphs in order to be able to visualise the results obtained in previous sections. Specifically, firstly, the summary has been obtained, resulting in Figure 11.
plotmafSummary(NGS)
Figure 11. MAF summary.
This first image shows a summary of the mutations observed in the different samples, where missense mutations definitely stand out.
Next, we proceeded to generate the Oncoplot to see the most mutated genes together with the number of mutations in each sample, resulting in Figure 12.
oncoplot(NGS)
Figure 12. Oncoplot.
The image shows the most mutated genes, indicating the number of mutations in each of the samples and the type of mutation. Specifically, it can be seen that missense mutations stand out and that the number of mutations is high.
We then proceeded to generate the different Lollipops to analyse the mutations in a series of genes in particular. Specifically, the ESPL1, FGFR3, STAG2 and TACC3 genes were analysed, as these were the most representative genes of the study being compared. It should be noted that when plotting the Lollipops it was only possible to obtain the result when analysing ESPL1, as in the rest of the cases the result obtained indicated that the genes had no mutations. Therefore, it was only possible to obtain one Lollipop and this has been shown in Figure 13.
lollipopPlot(NGS, "ESPL1", AACol = "Protein_position")
Figure 13. Gen ESPL1 Lollipop
In this graph, amino acid changes are used and therefore, this implies that the mutations must be located in exons. Specifically, Figure 13 shows that the ESPL1 gene contains missense mutations.
Taking into account that it has not been possible to obtain graphs for most of the genes analysed, we wanted to find out the reasons for this. Specifically, first of all, we checked that the files generated were correct, as it was initially thought that the error could be due to a failure in the procedures carried out. However, by comparing the number of mutations between the vcf files and the maf files, it was shown that the number was the same and that the process followed was therefore correct. Hence, this indicated that the lack of mutations in Lollipop was due to the type of mutations, so an in-depth analysis was done to see the types of mutations.
To begin this analysis, we first looked at the total number of mutations in each of the genes analysed and these results have been presented in the table of Figure 14.
Figure 14. Total mutations in each gene.
Analysing the results, it can be seen that there are many mutations in each gene, so the next step was to analyse the type of mutations. Specifically, the result of this analysis has been represented in the table of Figure 14, where the types and the number of mutations existing in each gene has been represented.
Figure 15. Mutations types in each gene.
In order to understand the type of mutation represented in the table, each mutation has been described below:
5’ Flanking region: region of DNA adjacent to the 5’ end of the gene. Contains the promoter and may contain enhancers or other protein binding sites. It is the region of DNA that is not transcribed into RNA.
Missense mutation: a DNA change that results in different amino acids being encoded at a particular position in the resulting protein. Some missense mutation alter the function of the resulting protein.
Flanking region: a region of DNA which is not copied into the mature mRNA, bit which is present adjacent to 3’/5’ end of the gene.
Silent mutation: is a change in the sequence of nucleotide bases which constitutes DNA, without a subsequent change in the amino acid or the function of the overall protein.
Splice-site mutation: genetic alteration in the DNA sequence that occurs at the boundary of an exon and an intron. This change can disrupt RNA splicing resulting in the loss of exons or the inclusion of introns and an altered protein-coding sequence.
Untranslated region (UTR): section of messenger RNA that immediately follows the translation termination codon. Several regions of the mRNA molecule are not translated into a protein inclusing the 5’ cap, 5’ untranslated region, 3’ untranslated region and poly(A) tail.
Taking into account the types of mutations described and the fact that only mutations located in exons are represented in a Lollipop, it is coherent that only the Lollipop of the ESPL1 gene could be obtained. However, considering the study on which this analysis was based, it was expected to observe more mutations, especially in the STAG2 gene. Nevertheless, although in this case this gene is the one with the highest number of mutations, it should be noted that only 8 samples were used, so it is consistent to think that the mutations observed in the paper could not be obtained in this case.
In this step all FASTQC files from patients in the study were alligned against the reference human transcriptome. The objective here is to obtain a .bam file from each alligment.Here the code used for kallisto alligment is shown:
#!/bin/bash
#SBATCH --job-name=Kallisto
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=1
#SBATCH -N 1 # On one node
#SBATCH --mem=4G
#SBATCH --time=0-23:0:0
##SBATCH --mail-type=END
##SBATCH --mail-user=a905162@alumni.unav.es
#SBATCH -o /scratch/a905162/ngs/logs/kallisto%a.log
source /scratch/arubio/bin/moduleloadNGS
FILES=($(cd /scratch/arubio/RNAseq/ && basename --suffix=_1.fastq.gz -- *_1.fastq.gz))
filename=${FILES[$SLURM_ARRAY_TASK_ID]}
mkdir /scratch/a905162/ngs/KallistoMaps/${filename}
chmod 770 /scratch/a905162/ngs/KallistoMaps/${filename}
kallisto quant -i /scratch/arubio/GenomicFiles/kallisto_Assignment_index.idx -o /scratch/a905162/ngs/KallistoMaps/$filename -b 25 -t $SLURM_CPUS_PER_TASK -g /
This code was run in ATLAS PC in order to paralelize the job, at first 1 node and 16G of RAM were used but the job efficiency was 23.12%. In order to have a better usage of ATLAS resources, the job was sent again but only using 4G of RAM memory. Finally the efficiency reached has been represented in Figure 16.
Figure 16. Efficiency of the process to do Kallisto.
To process the RNA-Seq data, to begin with, the required libraries had to be loaded.
library(readr)
library(sleuth)
library(biomaRt)
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
library(gplots)
library(ggrepel)
library(EnsDb.Hsapiens.v79)
The sample data was downloaded from ADI, and it was also necessary to change the names of the columns in order to correctly identify each of the variables. Furthermore, it was also important to split the column called sample_name into two different variables: sample_ID (IDs of the subjects) and sample_type (Normal or Cancer).
samples <- read_csv("samples_finalwork.csv",col_names = FALSE)
# View(samples)
colnames(samples) <- c("ID", "library_strategy", "base_pair_fragment", "Bioproject", "Biosample", "Accession", "X7", "library_selection", "library_source", "library_name", "end_date", "X12", "X13", "Run", "SRS", "SRA_study", "sample_name", "read_length", "center_name", "data_type", "sra", "ncbi", "library_instrument", "library_layout", "organism", "seq_type", "Published")
samples$sample_ID <- sapply(strsplit(samples$sample_name, "\\.|_"), function(X) X[1]) #selecciona la primera parte de antes de la "_"
samples$sample_type <- sapply(strsplit(samples$sample_name, "\\.|_"), function(X) X[2]) # selecciona la segunda parte de antes de la "_"
Finally, in order to relate the different transcripts with their corresponding genes, the following chunk of code shows the information that has been downloaded from Ensembl (citar), where we have the gene ID, the transcript ID, the chromosome name, and both the start and end of the transcript. However, we will only use the first two variables.
# Annotation
# Get all the transcripts, genes, go_id and domains
mart <- biomaRt::useMart(host = 'may2021.archive.ensembl.org',
biomart = "ensembl",
dataset = "hsapiens_gene_ensembl") # hsapiens for human
Atributos <- listAttributes(mart)
GeneTransciptLocation <- getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","chromosome_name","transcript_start","transcript_end"),mart=mart)
Once having obtained the output from Kallisto, we have merged that information into a matrix containing the TPM values of each transcript per sample.
base_dir <- "/home/osboxes/Desktop/NGSFolder/FinalWork_NGS_HPC/FinalWork_NGS_HPC/samples" #base_dir = directory where the output of kallisto is stored
sample_id <- dir(base_dir) #sample_id = the names of the folders.
dirsample<-(paste0(base_dir,"/",sample_id[1]))
dirtoload <- paste0(dirsample,"/","abundance.tsv")
RNASeq <- read.delim(dirtoload,sep = "\t", colClasses = c(NA,"NULL","NULL","NULL",NA))
for (n in 2:length(sample_id)){
dirsample<-(paste0(base_dir,"/",sample_id[n]))
dirtoload <- paste0(dirsample,"/","abundance.tsv")
RNASeq[,n+1] <- read.delim(dirtoload,sep = '\t', colClasses = c('NULL','NULL','NULL','NULL',NA))
}
rownames(RNASeq) <- sapply(strsplit(as.character(RNASeq[,1]),"\\|"),function(X) return(X[1]))
RNASeq<-RNASeq[,-1]
RNASeq2<-as.matrix(RNASeq)
colnames(RNASeq2)<-sample_id
Next, we will explain both the design and contrast matrices that has been built. The variable that will be studied is the sample type, that is, the difference in gene expression between normal and malignant samples. We will take as reference the normal samples.
# we build the design matrix
model_matrix <- model.matrix(~ sample_ID + sample_type, data = samples)
As we will study the sample type variable, we just need to build a contrast matrix for this one. Additionally, we need to define the betas and the null hypothesis. First of all, let’s establish the other variables in the design matrix. From the second column to the sixteenth one represent the different patients, since the study is paired. The last column refers to the variable that we want to study, that is, the sample type. Now, we can define the null hypothesis and build the contrast matrix.
We define \(\beta_0\) as the mean expression of the transcripts in the normal sample of patient B18. Then, \(\beta_0 + \beta_1\) is the mean expression of the normal sample of patient B23, and so on and so forth until \(\beta_0 + \beta_{15}\), which is the mean isoform expression of the normal sample of patient B77. However, we are not interested in studying the differences in gene expression among different patients. In fact, as the contrast that we want to make is related to the difference between normal and malignant samples, the null hypothesis is that there is not a significant difference between those two, so \(\beta_{16}}\) needs to be equal to 0.
# we create the contrast matrix
cont.matrix <- c(rep(0, ncol(model_matrix)-1), 1)
After having run kallisto for each of the samples, a differential expression analysis will be carried out using sleuth. Sleuth is an RNA-Seq analysis program for which kallisto has been used to quatify transcript abundances (cite). First of all, we need to create a tab delimited metadata file for the experiment so we have the factors for each sample ID (header “sample”), the ID of the patients (header “sample_ID”), the type of the sample, that is, whether the sample is normal or malignant (header “sample_type”), and the file path (header “path”) to the kallisto output for each sample.
samples_red <- samples[which(samples$Run %in% sample_id),]
samples_red <- samples_red[match(sample_id, samples_red$Run),]
#With the next line we obtain the complete path
kal_dirs <- sapply(sample_id, function(id) file.path(base_dir,id))
#Build a data.frame with the sample ID, patient ID, condition (normal/cancer) and the kallisto directories for each sample.
s2c <- data.frame(sample = sample_id,
sample_ID = samples_red$sample_ID,
sample_type = samples_red$sample_type)
#check that it is ordered
if (!identical(as.vector(s2c$sample),names(kal_dirs))){
iix <- match(s2c$sample,names(kal_dirs))
kal_dirs<-kal_dirs[iix]
}
s2c <- dplyr::select(s2c, sample = sample, sample_ID = sample_ID, sample_type = sample_type)
s2c <- dplyr::mutate(s2c, path = kal_dirs)
#Assign the condition value to the intercept
s2c$sample_type <- relevel(as.factor(s2c$sample_type), "Normal")
s2c_ordered <- s2c[with(s2c, order(sample_type)),]
Then, we follow a series of steps in order to obtain the final table. For that, we first load the kallisto processed data into the object using the sleuth_prep function. Afterwards we estimate parameters for both the full and reduced sleuth model and we finally perform the differential analysis using the likelihood ratio test (LRT). However, as this method does not report the fold change, we also need to run the Wald test (WT), even though the former test is considered a better one. For most of the analyses and plots that will be carried out in this project the output corresponding to the LRT test will be used, employing the table generated with the WT test just to visualize the volcano plots.
# Load the kallisto processed data into the object
s1 <- sleuth_prep(s2c_ordered, extra_bootstrap_summary=T)
# Estimate parameters for the sleuth response error measurement (full) model
so <- sleuth_fit(s1, ~ sample_ID + sample_type, fit_name = 'full')
# Estimate parameters for the sleuth reduced model
so <- sleuth_fit(so, ~ sample_ID, fit_name = 'reduced')
# Perform differential analysis (testing)
so <- sleuth_lrt(so,'reduced', 'full')
# Statistical Analysis result
sleuth_table_lrt <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
sleuth_significant_lrt <- dplyr::filter(sleuth_table_lrt, qval <= 0.001) #hemos bajado el qvalue de 0.05 a 0.001 porque con 0.05 habia 45000 trnascritos significativos, el valor de 0.001 lo hemos ido ajustando
# We also prform the WT analysis
so <- sleuth_wt(so, paste0('sample_typeCancer'))
sleuth_table_wt <- sleuth_results(so, paste0('sample_typeCancer'))
The enrichGO function of the ClusterProfiler R package was used to perform the Gene Ontology (GO) enrichment analysis for biological processes [cite]. The genes related to the gene transcripts with FDR < 0.001 were selected. The gene universe consisted of the more than 60,000 genes that were identified in Ensembl [cite]. Moreover, in all the cases, we selected the GO terms with FDR < 0.05 and, among them, we used a cut-off value of 0.7 to reduce redundant GO terms, which means that out of the terms that have a similarity higher than 0.7 (redundant terms), only one of them will be selected (representative term). Based on the FDR value, the top 20 GO terms were selected as the most enriched processes.
sig_transcripts <- sapply(strsplit(sleuth_significant$target_id,"\\."),function(X) return(X[1]))
iij <- which(GeneTransciptLocation$ensembl_transcript_id %in% sig_transcripts)
sig_genes <- GeneTransciptLocation$ensembl_gene_id[iij]
universe <- unique(GeneTransciptLocation$ensembl_gene_id)
enrichGO <- clusterProfiler::enrichGO(gene = sig_genes, OrgDb = "org.Hs.eg.db",
keyType ='ENSEMBL', ont ='BP', pvalueCutoff = 0.01,
pAdjustMethod = 'fdr', universe = universe,
qvalueCutoff = 0.05, minGSSize = 5, maxGSSize = 600,
readable ='FALSE', pool = 'FALSE')
database_enrichgo <- data.frame(enrichGO)
# We use simplify to remove redundant GO terms
enrichGO@result <- enrichGO@result[which(enrichGO@result$p.adjust < 0.05),]
enrichGO_simplified <- simplify(enrichGO, cutoff=0.7, by="p.adjust", select_fun=min)
database_enrichGO_simplified <- data.frame(enrichGO_simplified)
As our project is based on a scientific article that was published in 2013, we decided to compare some of the genes that were obtained in our analysis with the ones reported by the researchers. To do that, first of all, we had to change the ID of the genes in order to match the names with the ones that appear on the paper. We were most interested in the comparison between the genes related to the chromosome segregation function that appeared in the GO enrichment analysis (GO:0007059) with the genes that the paper mentioned are related to the same function (STAG2 and ESPL1). As a result, we could see that both of them were included in our results, which is completely logical.
sig_genes_symbol_ens <- ensembldb::select(EnsDb.Hsapiens.v79, keys = sig_genes, keytype = "GENEID", columns = c("SYMBOL","GENEID"))
sig_genes_symbol <- sig_genes_symbol_ens[,1]
genes_paper_RNA <- c("STAG2","ESPL1","FGFR3","TACC3") #genes que aparecen en el abstract. STAG2 y ESPL1 están relacionados con la función de "sister chromatid cohesion and segregation"
genes_conocidos <- c("TP53", "HRAS", "FGFR3", "PIK3CA", "RB1", "KRAS", "TSC1")
chrom_rem_genes <- c("UTX", "ARID1A", "MLL-MLL3", "CREBBP-EP300", "NCOR1", "CHD6")
database_enrichGO_simplified["GO:0007059","geneID"] #hay que eliminar las barras diagonales para tener un vector con los nombres de los genes
#PAra el variant calling debe haber mutaciones en los genes: UTX, MLL-MLL3, CREBBP-EP300, NCOR1,ARID1A, CHD6 (estos genes son los que se han descubierto)
#-Otros genes conocidos que estan mutados en bladder cancer: TP53, HRAS, FGFR3, PIK3CA, RB1, KARS, TSC1
which(genes_conocidos %in% sig_genes_symbol) #KRAS y HRAS
which(chrom_rem_genes %in% sig_genes_symbol) #NCOR1
## Hace falta comparar con la lista del choromosome segregation
database_enrichGO_simp_genes <- data.frame(database_enrichGO_simplified$geneID)
rownames(database_enrichGO_simp_genes) <- rownames(database_enrichGO_simplified)
genes_go_chrom_seg <- strsplit(database_enrichGO_simp_genes["GO:0007059",],"\\/")[[1]]
genes_go_chrom_seg_symbol_ens <- ensembldb::select(EnsDb.Hsapiens.v79, keys = genes_go_chrom_seg, keytype = "GENEID", columns = c("SYMBOL","GENEID"))
genes_go_chrom_seg_symbol <- genes_go_symbol_ens[,1]
table(genes_paper_RNA %in% genes_go_chrom_seg_symbol)
which(genes_paper_RNA %in% genes_go_chrom_seg_symbol) #STAG2, ESPL1 y TACC3
La parte de IGV
We were expecting to find also the FGFR3 gene, but we realized that the researchers from the scientific article analyzed the gene fusions, which is not related to the differential expression analysis that has been carried out. For this reason, another approach has been taken. We have used the following chunk of code to download the .bam and .bai files of the cancer sample number SRR645259 from subject B59. With these files, we have opened them in the IGV software. As we were expecting to find a gene fusion, we were expecting to find each of the mates from genes FGFR3 and TACC3. However, as can be seen in the image below (insertar captura de IGV), it cannot be seen what had initially been expected. For this reason, it has been decided to process the raw data of the corresponding sample using a different method.
scp a905208@atlas-fdr.sw.ehu.es:/scratch/arubio/DNAseq/resultsBWA/noduplicatesbam/SRR645259_aligned.bam* E:/NGSFolder/FinalWork_NGS_HPC/FinalWork_NGS_HPC/IGV_sample
La parte de STAR
First of all, the following .sh file has been created in order to get the required index.
#!/bin/bash
#SBATCH --account=biomedicine
#SBATCH --partition=biomedicine
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --mem=1G
#SBATCH --time=01:00:00
cd /scratch/arubio/bin
. moduleloadNGS
cd /scratch/a905208/FinalWork/STAR/index
STAR --runThreadN 20 --runMode genomeGenerate --genomeDir /scratch/a905208/FinalWork/STAR/index --genomeFastaFiles /scratch/a905208/FinalWork/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /scratch/a905208/FinalWork/Homo_sapiens.GRCh38.106.gtf --sjdbOverhang 89
For this, it has been necessary to download the Homo_sapiens.GRCh38.106.gtf.gz file from Ensembl, since the one that was already available (gencode.v24.annotation.sorted.gtf) was not compatible with the FASTA file.
wget ftp://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz
Once finished, the following two files (S1_R1.fq and S1_R2.fq) has been created.
cat SRR645259_1.fastq | head -n 1000000 > S1_R1.fq
cat SRR645259_2.fastq | head -n 1000000 > S1_R2.fq
Then, the following chunk of code has been executed from the command line.
STAR --runThreadN 3 --genomeDir /scratch/a905208/FinalWork/STAR/index --readFilesIn S1_R1.fq S1_R2.fq --outFileNamePrefix S1_test --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes NH HI AS NM MD
Finally, to generate the .bai file from the .bam file that has been generated in the previous step, the following code has been run.
samtools index S1_testAligned.sortedByCoord.out.bam
To separate all the samples in clusters based on the results of differential gene expression analysis we first performed a PCA selecting the 8198 differentially expressed transcripts (FDR < 0.001). The correct separation between the sample groups confirms that the selected transcripts could be considered adequate to stratify the samples in the two groups. Moreover, the axes are ranked in order of importance, where differences along the first principal component axis (PC1) are more important than differences along with the second principal analysis (PC2).
rownames(RNASeq2) <- sapply(strsplit(rownames(RNASeq2), "\\."), function(X) X[1])
RNASeq2_filtered <- RNASeq2[sig_transcripts,]
col.order <- s2c_ordered$sample
RNASeq2_filt_ord <- RNASeq2_filtered[, col.order]
PCs <- prcomp(t(log2(1+RNASeq2_filt_ord)), center = T, retx = T)
data_PCA <- data.frame(samples = rownames(PCs$x),
ID = s2c_ordered$sample_ID,
type = s2c_ordered$sample_type,
PC1 = PCs$x[,1],
PC2 = PCs$x[,2])
In order to visualize more clearly the expression pattern of the most significant transcripts, a heatmap has been plotted. In addition to showing both the sample and transcript IDs in the columns and rows respectively, they have been distributed using a hierarchical clustering algorithm, so that the samples or transcripts that show a more similar behavior are closer to each other.
Next, we represent the gene expression values measured in TPMs of the top significant transcripts. To select the 10 most significant transcripts, those with a lower FDR value were chosen.
sig_transcripts_sample <- data.frame(tran_name_1 = sapply(strsplit(sleuth_significant_lrt$target_id[1:10],"\\."),function(X) return(X[1])),
tran_name_2 = sapply(strsplit(sleuth_significant_lrt$target_id[1:10],"\\."),function(X) return(X[2])))
iij_sample <- which(GeneTransciptLocation$ensembl_transcript_id %in% sig_transcripts_sample$tran_name_1)
sig_genes_sample <- GeneTransciptLocation$ensembl_gene_id[iij_sample]
sig_trans_genes_sample <- GeneTransciptLocation[iij_sample, c(1,2)]
colnames(sig_transcripts_sample)[1] <- "ensembl_transcript_id"
sig_df <- merge(sig_transcripts_sample, sig_trans_genes_sample, by = c("ensembl_transcript_id"))
sig_genes_symbol_ens_sample <- ensembldb::select(EnsDb.Hsapiens.v79, keys = sig_genes_sample, keytype = "GENEID", columns = c("SYMBOL","GENEID"))
colnames(sig_genes_symbol_ens_sample)[2] <- colnames(sig_df)[3]
sig_df <- merge(sig_df, sig_genes_symbol_ens_sample, by = c("ensembl_gene_id"))
sig_df$SYMBOL_transcript_id <- paste0(sig_df$SYMBOL, "_", sig_df$tran_name_2)
RNASeq2_filt_ord_sample <- log2(1+RNASeq2_filt_ord[sig_transcripts_sample,])
sig_df_bp <- sig_df[match(sig_transcripts_sample, sig_df$ensembl_transcript_id),]
data_boxplot <- data.frame(transcripts = rep(sig_df_bp$SYMBOL_transcript_id, length(col.order)), sample_id = rep(col.order, each=length(sig_df_bp$SYMBOL_transcript_id)), sample_type = gl(2,length(sig_df_bp$SYMBOL_transcript_id)*num_normal,labels = c('Normal','Tumor')), values = as.vector(RNASeq2_filt_ord_sample))
data_boxplot$transcripts = as.factor(data_boxplot$transcripts)
Differential expression analysis revealed that between the normal and malignant groups of patients, there are 879 transcripts altered with an FDR < 0.01 and a beta > |3|. Moreover, 141 of them belong to the overexpressed group (beta > 3) whereas 738 of the isoforms are underexpressed (beta < -3). The labelled transcripts are the ones whose expression was plotted through a boxplot in the previous image.